65 research outputs found
X-som: A flexible ontology mapper
System interoperability is a well known issue, especially for heterogeneous information systems, where ontologybased representations may support automatic and usertransparent integration. In this paper we present X-SOM: an ontology mapping and integration tool. The contribution of our tool is a modular and extensible architecture that automatically combines several matching techniques by means of a neural network, performing also ontology debugging to avoid inconsistencies. Besides describing the tool components, we discuss the prototype implementation, which has been tested against the OAEI 2006 benchmark with promising results.
No Bits Left Behind
One of the key tenets of database system design is making efficient
use of storage and memory resources. However, existing database
system implementations are actually extremely wasteful of such
resources; for example, most systems leave a great deal of empty
space in tuples, index pages, and data pages, and spend many
CPU cycles reading cold records from disk that are never used.
In this paper, we identify a number of such sources of waste, and
present a series of techniques that limit this waste (e.g., forcing
better memory locality for hot data and using empty space in index
pages to cache popular tuples) without substantially complicating
interfaces or system design. We show that these techniques
effectively reduce memory requirements for real scenarios from
the Wikipedia database (by up to 17.8×) while increasing query
performance (by up to 8×)
Relational Cloud: The Case for a Database Service
In this paper, we make the case for â databases as a serviceâ (DaaS), with two target scenarios in mind: (i) consolidation of data management functionality for large organizations and (ii) outsourcing data management to a cloud-based service provider for small/medium organizations. We analyze the many challenges to be faced, and discuss the design of a database service we are building, called Relational Cloud. The system has been designed from scratch and combines many recent advances and novel solutions. The prototype we present exploits multiple dedicated storage engines, provides high-availability via transparent replication, supports automatic workload partitioning and live data migration, and provides serializable distributed transactions. While the system is still under active development, we are able to present promising initial results that showcase the key features of our system. The tests are based on TPC benchmarks and real-world data from epinions.com, and show our partitioning, scalability and balancing capabilities
Schism: a Workload-Driven Approach to Database Replication and Partitioning
We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner).
The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.Quanta Computer (Firm) (T-Party Project
Workload-Aware Database Monitoring and Consolidation
In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be potentially harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation.
Instead, our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12× higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and MIT CSAIL, showing absolute consolidation ratios ranging between 5.5:1 and 17:1
ERA: A Framework for Economic Resource Allocation for the Cloud
Cloud computing has reached significant maturity from a systems perspective,
but currently deployed solutions rely on rather basic economics mechanisms that
yield suboptimal allocation of the costly hardware resources. In this paper we
present Economic Resource Allocation (ERA), a complete framework for scheduling
and pricing cloud resources, aimed at increasing the efficiency of cloud
resources usage by allocating resources according to economic principles. The
ERA architecture carefully abstracts the underlying cloud infrastructure,
enabling the development of scheduling and pricing algorithms independently of
the concrete lower-level cloud infrastructure and independently of its
concerns. Specifically, ERA is designed as a flexible layer that can sit on top
of any cloud system and interfaces with both the cloud resource manager and
with the users who reserve resources to run their jobs. The jobs are scheduled
based on prices that are dynamically calculated according to the predicted
demand. Additionally, ERA provides a key internal API to pluggable algorithmic
modules that include scheduling, pricing and demand prediction. We provide a
proof-of-concept software and demonstrate the effectiveness of the architecture
by testing ERA over both public and private cloud systems -- Azure Batch of
Microsoft and Hadoop/YARN. A broader intent of our work is to foster
collaborations between economics and system communities. To that end, we have
developed a simulation platform via which economics and system experts can test
their algorithmic implementations
Kaskade: Graph Views for Efficient Graph Analytics
Graphs are an increasingly popular way to model real-world entities and
relationships between them, ranging from social networks to data lineage graphs
and biological datasets. Queries over these large graphs often involve
expensive subgraph traversals and complex analytical computations. These
real-world graphs are often substantially more structured than a generic
vertex-and-edge model would suggest, but this insight has remained mostly
unexplored by existing graph engines for graph query optimization purposes.
Therefore, in this work, we focus on leveraging structural properties of graphs
and queries to automatically derive materialized graph views that can
dramatically speed up query evaluation. We present KASKADE, the first graph
query optimization framework to exploit materialized graph views for query
optimization purposes. KASKADE employs a novel constraint-based view
enumeration technique that mines constraints from query workloads and graph
schemas, and injects them during view enumeration to significantly reduce the
search space of views to be considered. Moreover, it introduces a graph view
size estimator to pick the most beneficial views to materialize given a query
set and to select the best query evaluation plan given a set of materialized
views. We evaluate its performance over real-world graphs, including the
provenance graph that we maintain at Microsoft to enable auditing, service
analytics, and advanced system optimizations. Our results show that KASKADE
substantially reduces the effective graph size and yields significant
performance speedups (up to 50X), in some cases making otherwise intractable
queries possible
LST-Bench: Benchmarking Log-Structured Tables in the Cloud
Log-Structured Tables (LSTs), also commonly referred to as table formats,
have recently emerged to bring consistency and isolation to object stores. With
the separation of compute and storage, object stores have become the go-to for
highly scalable and durable storage. However, this comes with its own set of
challenges, such as the lack of recovery and concurrency management that
traditional database management systems provide. This is where LSTs such as
Delta Lake, Apache Iceberg, and Apache Hudi come into play, providing an
automatic metadata layer that manages tables defined over object stores,
effectively addressing these challenges. A paradigm shift in the design of
these systems necessitates the updating of evaluation methodologies. In this
paper, we examine the characteristics of LSTs and propose extensions to
existing benchmarks, including workload patterns and metrics, to accurately
capture their performance. We introduce our framework, LST-Bench, which enables
users to execute benchmarks tailored for the evaluation of LSTs. Our evaluation
demonstrates how these benchmarks can be utilized to evaluate the performance,
efficiency, and stability of LSTs. The code for LST-Bench is open sourced and
is available at https://github.com/microsoft/lst-bench/
- …